STATGR 5243, Project 1

Finding the Pursuit of Happiness - EDA and NLP Analyses of HappyDB: A Corpus of 100,000 Crowdsourced Happy Moments

"True happiness is to enjoy the present, without anxious dependence upon the future..."
- Lucius Annaeus Seneca
In [229]:
from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')
Out[229]:
The raw code for this IPython notebook is by default hidden for easier reading. To toggle on/off the raw code, click here.

Part I: Which people are behind the Happy Moments corpus?

Before analyzing the happy moments themselves, it is important to identify the demographic information of those individuals who provided the crowd-sourced happy moments. Understanding the demographic information will help us to better understand who exactly is contributing to the corpus of moments, and will aid in helping ot make inferences about the data.

We begin by importing the data.

In [230]:
import pandas as pd
df = pd.DataFrame(pd.read_csv('/Users/matthewvitha/Downloads/cleaned_hm.csv'))
df.head(2)
Out[230]:
hmid wid reflection_period original_hm cleaned_hm modified num_sentence ground_truth_category predicted_category
0 27673 2053 24h I went on a successful date with someone I fel... I went on a successful date with someone I fel... True 1 NaN affection
1 27674 2 24h I was happy when my son got 90% marks in his e... I was happy when my son got 90% marks in his e... True 1 NaN affection
In [231]:
df_sense = pd.DataFrame(pd.read_csv('/Users/matthewvitha/Downloads/senselabel (2).csv'))
df_sense.head(2)
Out[231]:
hmid tokenOffset word lowercaseLemma POS MWE offsetParent supersenseLabel
0 31526 1 I i PRON O 0 NaN
1 31526 2 found find VERB O 0 v.cognition
In [232]:
df_demo = pd.DataFrame(pd.read_csv('/Users/matthewvitha/Downloads/demographic (1).csv'))
df_demo.head(2)
Out[232]:
wid age country gender marital parenthood
0 1 37.0 USA m married y
1 2 29.0 IND m married y
In [233]:
print(df_demo.shape)
(10844, 6)

Further, we will remove Null values from the demographic data.

In [234]:
df_demo_nonan = df_demo.dropna(how='any')
In [235]:
df_demo_nonan.shape
Out[235]:
(10689, 6)

Note- I also imported a separate dataframe for plotting age demographcis, as manipulating some of the data was easier to perform in Excel and then re-import in Jupyter Notebook.

In [236]:
df_demo_age = pd.DataFrame(pd.read_csv('/Users/matthewvitha/Downloads/demo_age.csv'))
In [237]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
In [238]:
df_demo_age = df_demo_age.dropna()
sns.distplot(df_demo_age['age'])
plt.title('Age Distribution')
/Users/matthewvitha/anaconda/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[238]:
Text(0.5,1,'Age Distribution')
In [239]:
df_demo_nonan.head(2)
Out[239]:
wid age country gender marital parenthood
0 1 37.0 USA m married y
1 2 29.0 IND m married y
In [240]:
import pandas_profiling
In [241]:
pandas_profiling.ProfileReport(df_demo_nonan)
Out[241]:

Overview

Dataset info

Number of variables 7
Number of observations 10689
Total Missing (%) 0.0%
Total size in memory 584.6 KiB
Average record size in memory 56.0 B

Variables types

Numeric 1
Categorical 5
Boolean 0
Date 0
Text (Unique) 0
Rejected 1
Unsupported 0

Warnings

  • age has a high cardinality: 138 distinct values Warning
  • country has a high cardinality: 100 distinct values Warning
  • wid is highly correlated with index (ρ = 0.9979) Rejected

Variables

age
Categorical

Distinct count 138
Unique (%) 1.3%
Missing (%) 0.0%
Missing (n) 0
26
 
354
25
 
353
27
 
335
Other values (135)
9647
Value Count Frequency (%)  
26 354 3.3%
 
25 353 3.3%
 
27 335 3.1%
 
29 317 3.0%
 
30 315 2.9%
 
24 299 2.8%
 
28 297 2.8%
 
31 280 2.6%
 
32 272 2.5%
 
23 262 2.5%
 
Other values (128) 7605 71.1%
 

country
Categorical

Distinct count 100
Unique (%) 0.9%
Missing (%) 0.0%
Missing (n) 0
USA
9203
IND
 
957
CAN
 
64
Other values (97)
 
465
Value Count Frequency (%)  
USA 9203 86.1%
 
IND 957 9.0%
 
CAN 64 0.6%
 
VEN 54 0.5%
 
GBR 48 0.4%
 
PHL 32 0.3%
 
MEX 22 0.2%
 
BRA 14 0.1%
 
AUS 14 0.1%
 
NGA 12 0.1%
 
Other values (90) 269 2.5%
 

gender
Categorical

Distinct count 3
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
f
5372
m
5262
o
 
55
Value Count Frequency (%)  
f 5372 50.3%
 
m 5262 49.2%
 
o 55 0.5%
 

index
Numeric

Distinct count 10689
Unique (%) 100.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 5391.3
Minimum 0
Maximum 10843
Zeros (%) 0.0%

Quantile statistics

Minimum 0
5-th percentile 534.4
Q1 2672
Median 5368
Q3 8106
95-th percentile 10299
Maximum 10843
Range 10843
Interquartile range 5434

Descriptive statistics

Standard deviation 3134.3
Coef of variation 0.58137
Kurtosis -1.2034
Mean 5391.3
MAD 2715.9
Skewness 0.014099
Sum 57627909
Variance 9824100
Memory size 83.6 KiB
Value Count Frequency (%)  
2047 1 0.0%
 
5440 1 0.0%
 
3387 1 0.0%
 
1338 1 0.0%
 
7481 1 0.0%
 
5432 1 0.0%
 
9526 1 0.0%
 
3379 1 0.0%
 
1330 1 0.0%
 
7473 1 0.0%
 
Other values (10679) 10679 99.9%
 

Minimum 5 values

Value Count Frequency (%)  
0 1 0.0%
 
1 1 0.0%
 
2 1 0.0%
 
3 1 0.0%
 
4 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
10839 1 0.0%
 
10840 1 0.0%
 
10841 1 0.0%
 
10842 1 0.0%
 
10843 1 0.0%
 

marital
Categorical

Distinct count 5
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
single
5637
married
4328
divorced
 
555
Other values (2)
 
169
Value Count Frequency (%)  
single 5637 52.7%
 
married 4328 40.5%
 
divorced 555 5.2%
 
separated 101 0.9%
 
widowed 68 0.6%
 

parenthood
Categorical

Distinct count 2
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
n
6382
y
4307
Value Count Frequency (%)  
n 6382 59.7%
 
y 4307 40.3%
 

wid
Highly correlated

This variable is highly correlated with index and should be ignored for analysis

Correlation 0.9979

Correlations

Sample

wid age country gender marital parenthood
0 1 37.0 USA m married y
1 2 29.0 IND m married y
2 3 25 IND m single n
3 4 32 USA m married y
4 5 29 USA m married y

By using the Pandas Profiling tool, we ascertain a much better understanding of the demographics of those individuals putting forth their happy moments. Key take aways are as follows:

A. Slightly right skewed age distribution, with the mean centering around 25.
B. Overwhelming majority of the individuals are American (~86%), followed by individuals who are Indian (9%).
C. Almost 50-50 split between males and females.
D. Slightly above 50% of those individuals sampled are single, followed by ~40% of individuals who are married, lastly followed by ~5% of individuals who are divorced.
E. ~60% of those sampled do not have children, while ~40% of those sampled do have children.

Part II: What observations can we make about individuals' happy moments from a peripheral level?

Now that we have an understanding of those individuals whose happy moments comprise our corpus, it would be beneficial to identify certain facts about the happy moments themselves.

The other two datasets provided in the analysis can help provide us with this information.

In [242]:
df['hm_length']  = df['cleaned_hm'].str.len()
sns.distplot(df['hm_length'])
plt.title('Length of Happy Moment - Distribution')
/Users/matthewvitha/anaconda/lib/python3.5/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[242]:
Text(0.5,1,'Length of Happy Moment - Distribution')
In [243]:
sns.set(style="whitegrid")
ax = sns.boxplot(x="hm_length", hue="reflection_period",data=df, palette="Set3")

From the above two plots, we see that the majority of happy moments are under 1000 words, closer to 0 words than 1000. There remain a several outliers that persist in the 2000-6000 word range.

In [244]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks", color_codes=True)

f, ax = plt.subplots(figsize=(10, 5))
ax = sns.countplot(x="predicted_category", data=df)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
ax.set_axisbelow(True)
plt.tight_layout()
plt.title("Frequency of Predicted Category Labels")
plt.grid(True, color='k', linestyle='-', linewidth=2)
plt.show()

We can make another interesting observation from the 'predicted_category' field from the cleaned_hm dataset (here noted as df). We see here that 'affection' and 'achievement' are by far the most predicted categories. These two are followed by 'bonding', 'enjoy_the_moment', and 'leisure'.

'nature' and 'exercise' by comparison, are the least frequent predicted category label.
In [245]:
sns.catplot(x="hm_length", y="predicted_category",kind='box',hue='reflection_period', palette="Blues",data=df)
figsize=(50,50)
plt.title('Length of Happy Moment by Predicted Category')
Out[245]:
Text(0.5,1,'Length of Happy Moment by Predicted Category')

Analyzing the frequency of 'predicted_category' provides us with interesting observations. We can add a second layer to this analysis by comparing the frequency of 'predicted_category' with the length of the happy moments.

While the length of 'affection' matches that of what we would expect based on the previous chart, we notice that the length of achievement seems small compared to its frequency in the data. The length of 'achievement' is highly similar to that of 'enjoy_the_moment' and 'bonding'.

In [246]:
ax = sns.countplot(x="POS", data=df_sense)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.grid(True, color='k', linestyle='-', linewidth=2)
plt.title('Frequency of Different POS')
Out[246]:
Text(0.5,1,'Frequency of Different POS')

From the above plot, we can see the frequencies of different POS throughout the dataset. Perhaps not surprisingly, 'Noun', 'Verb', and 'Pron' are the most frequent POS. This implies that people are happy when they interact with a state or object, or when they interact with other people involving states or objects.

In [247]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks", color_codes=True)

f, ax = plt.subplots(figsize=(10, 5))
ax = sns.countplot(x="supersenseLabel", data=df_sense)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.title("Frequency of Different Super_sense Labels")
plt.grid(True, color='k', linestyle='-', linewidth=2)
plt.show()

Our observation from the above POS_histogram are confirmed by the supersenseLabel histogram depicted above. We see that 'v.stative', 'n.people', and 'n.time' contain the highest counts among all supersense labels. Moreover, we can infer that the most frequent happy moments involve spending time with other people aside from one's self, and that these actions entail a state of being/mind, rather than actions.

By looking at the plot however, we can also see that 'v.motion' and 'n.act' are present as well, allowing us to conclude that these two are also non-trivial in comprising happy moments.

Now that we have analyzed the demographics behind the people who provided their happy moments, and the actual happy moments themselves, we can try to algorithmically identify which happy moments are grouped together.

To accomplish this task, we implement the below algorithms:

A. TF_IDF Vectorization

B. K_Means Clustering

C. LDA Topic Modeling

Before we implement algorithms to analyze the textual data comprising the happy moments, we must first pre-process the text data. Pre-processing steps entail:

A. Turning all letters to lowercase

B. Tokenizing the happy moments

C. Removing stop words from the happy moments

D. Lemmatizing the happy moments

In [248]:
import pandas as pd
df = pd.DataFrame(pd.read_csv('/Users/matthewvitha/Downloads/cleaned_hm.csv'))
In [249]:
import string
df['cleaned_hm'] = [i.lower() for i in df['cleaned_hm']]
#new_list = [expression(i) for i in old_list if filter(i)]
In [250]:
df['cleaned_hm'] = [i.translate(string.punctuation) for i in df['cleaned_hm']]
In [251]:
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re

def preprocess(sentence):
	sentence = sentence.lower()
	tokenizer = RegexpTokenizer(r'\w+')
	tokens = tokenizer.tokenize(sentence)
	filtered_words = [w for w in tokens if not w in stopwords.words('english')]
	return " ".join(filtered_words)
In [326]:
df_small = df[:]
In [327]:
df_small['cleaned_hm'] = [preprocess(x) for x in df_small['cleaned_hm']]
In [328]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer=WordNetLemmatizer()
df_small['cleaned_hm'] = [lemmatizer.lemmatize(word) for word in df_small['cleaned_hm']]
In [329]:
from nltk.tokenize import regexp_tokenize, wordpunct_tokenize, blankline_tokenize
lemma=nltk.stem.WordNetLemmatizer()

lemma_books = []
for book in df_small['cleaned_hm']:
    lemma_book = [lemma.lemmatize(word) for word in wordpunct_tokenize(book)]
    lemma_book = (' ').join(lemma_book)
    lemma_books.append(lemma_book)

Now that the textual data has been preprocessed, we can proceed with our TF_IDF vectorization of the corpus. We will use sklearn's TfidfVectorizer to accomplish this task.

Note: Grid searching for optimal parameters of these algorithms was out_of_scope for this study.

from nltk.tokenize import RegexpTokenizer from nltk.tokenize import word_tokenize from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(min_df=10, max_features=180000, tokenizer=word_tokenize, ngram_range=(1, 2))
In [330]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
dtm = vectorizer.fit_transform(lemma_books).toarray()
vocab = np.array(vectorizer.get_feature_names())
dtm.shape
Out[330]:
(100535, 22233)

By creating a dictionary of the vectorized featre names and vectorizer.idf, we can create a data frame that shows different scores from the vectorization.

Not surprisingly, 'happy', 'got', 'went', 'made', and 'friend', were all some of the most frequently accounted for words in the corpus' vocabulary. The frequency of these words contrasts with 'rainbow' and 'tulip', which were used much more infrequently. This seems appropriate given that these nouns do not occur frequently in everyday life.

In [331]:
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
tfidf = pd.DataFrame(columns=['tfidf']).from_dict(
                    dict(tfidf), orient='index')
tfidf.columns = ['tfidf']
In [332]:
tfidf.sort_values(by=['tfidf'], ascending=True).head(10)
Out[332]:
tfidf
happy 2.782197
got 3.055617
made 3.235052
friend 3.291565
went 3.384920
time 3.445470
day 3.456779
new 3.478126
work 3.604990
last 3.788712
In [333]:
tfidf.sort_values(by=['tfidf'], ascending=False).head(10)
Out[333]:
tfidf
dyslexic 11.825124
haan 11.825124
esque 11.825124
burdwan 11.825124
ecig 11.825124
politically 11.825124
hematologist 11.825124
ethereal 11.825124
sisteras 11.825124
ethan 11.825124
In [334]:
from sklearn.decomposition import TruncatedSVD
In [335]:
n_comp=7
In [336]:
vz_sample = vectorizer.fit_transform(list(lemma_books))

We can plot our vectorized corpus using the Truncated SVD and TSNE models (set to two components for a 2-d view). 7 components were chosen given that there are 7 predicted cateogry labels.

In [337]:
svd = TruncatedSVD(n_components=n_comp, random_state=42)
svd_tfidf = svd.fit_transform(vz_sample)
In [338]:
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, verbose=1, random_state=42, n_iter=250)
In [339]:
tsne_tfidf = tsne_model.fit_transform(svd_tfidf)
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 100535 samples in 0.117s...
[t-SNE] Computed neighbors for 100535 samples in 22.622s...
[t-SNE] Computed conditional probabilities for sample 1000 / 100535
[t-SNE] Computed conditional probabilities for sample 2000 / 100535
[t-SNE] Computed conditional probabilities for sample 3000 / 100535
[t-SNE] Computed conditional probabilities for sample 4000 / 100535
[t-SNE] Computed conditional probabilities for sample 5000 / 100535
[t-SNE] Computed conditional probabilities for sample 6000 / 100535
[t-SNE] Computed conditional probabilities for sample 7000 / 100535
[t-SNE] Computed conditional probabilities for sample 8000 / 100535
[t-SNE] Computed conditional probabilities for sample 9000 / 100535
[t-SNE] Computed conditional probabilities for sample 10000 / 100535
[t-SNE] Computed conditional probabilities for sample 11000 / 100535
[t-SNE] Computed conditional probabilities for sample 12000 / 100535
[t-SNE] Computed conditional probabilities for sample 13000 / 100535
[t-SNE] Computed conditional probabilities for sample 14000 / 100535
[t-SNE] Computed conditional probabilities for sample 15000 / 100535
[t-SNE] Computed conditional probabilities for sample 16000 / 100535
[t-SNE] Computed conditional probabilities for sample 17000 / 100535
[t-SNE] Computed conditional probabilities for sample 18000 / 100535
[t-SNE] Computed conditional probabilities for sample 19000 / 100535
[t-SNE] Computed conditional probabilities for sample 20000 / 100535
[t-SNE] Computed conditional probabilities for sample 21000 / 100535
[t-SNE] Computed conditional probabilities for sample 22000 / 100535
[t-SNE] Computed conditional probabilities for sample 23000 / 100535
[t-SNE] Computed conditional probabilities for sample 24000 / 100535
[t-SNE] Computed conditional probabilities for sample 25000 / 100535
[t-SNE] Computed conditional probabilities for sample 26000 / 100535
[t-SNE] Computed conditional probabilities for sample 27000 / 100535
[t-SNE] Computed conditional probabilities for sample 28000 / 100535
[t-SNE] Computed conditional probabilities for sample 29000 / 100535
[t-SNE] Computed conditional probabilities for sample 30000 / 100535
[t-SNE] Computed conditional probabilities for sample 31000 / 100535
[t-SNE] Computed conditional probabilities for sample 32000 / 100535
[t-SNE] Computed conditional probabilities for sample 33000 / 100535
[t-SNE] Computed conditional probabilities for sample 34000 / 100535
[t-SNE] Computed conditional probabilities for sample 35000 / 100535
[t-SNE] Computed conditional probabilities for sample 36000 / 100535
[t-SNE] Computed conditional probabilities for sample 37000 / 100535
[t-SNE] Computed conditional probabilities for sample 38000 / 100535
[t-SNE] Computed conditional probabilities for sample 39000 / 100535
[t-SNE] Computed conditional probabilities for sample 40000 / 100535
[t-SNE] Computed conditional probabilities for sample 41000 / 100535
[t-SNE] Computed conditional probabilities for sample 42000 / 100535
[t-SNE] Computed conditional probabilities for sample 43000 / 100535
[t-SNE] Computed conditional probabilities for sample 44000 / 100535
[t-SNE] Computed conditional probabilities for sample 45000 / 100535
[t-SNE] Computed conditional probabilities for sample 46000 / 100535
[t-SNE] Computed conditional probabilities for sample 47000 / 100535
[t-SNE] Computed conditional probabilities for sample 48000 / 100535
[t-SNE] Computed conditional probabilities for sample 49000 / 100535
[t-SNE] Computed conditional probabilities for sample 50000 / 100535
[t-SNE] Computed conditional probabilities for sample 51000 / 100535
[t-SNE] Computed conditional probabilities for sample 52000 / 100535
[t-SNE] Computed conditional probabilities for sample 53000 / 100535
[t-SNE] Computed conditional probabilities for sample 54000 / 100535
[t-SNE] Computed conditional probabilities for sample 55000 / 100535
[t-SNE] Computed conditional probabilities for sample 56000 / 100535
[t-SNE] Computed conditional probabilities for sample 57000 / 100535
[t-SNE] Computed conditional probabilities for sample 58000 / 100535
[t-SNE] Computed conditional probabilities for sample 59000 / 100535
[t-SNE] Computed conditional probabilities for sample 60000 / 100535
[t-SNE] Computed conditional probabilities for sample 61000 / 100535
[t-SNE] Computed conditional probabilities for sample 62000 / 100535
[t-SNE] Computed conditional probabilities for sample 63000 / 100535
[t-SNE] Computed conditional probabilities for sample 64000 / 100535
[t-SNE] Computed conditional probabilities for sample 65000 / 100535
[t-SNE] Computed conditional probabilities for sample 66000 / 100535
[t-SNE] Computed conditional probabilities for sample 67000 / 100535
[t-SNE] Computed conditional probabilities for sample 68000 / 100535
[t-SNE] Computed conditional probabilities for sample 69000 / 100535
[t-SNE] Computed conditional probabilities for sample 70000 / 100535
[t-SNE] Computed conditional probabilities for sample 71000 / 100535
[t-SNE] Computed conditional probabilities for sample 72000 / 100535
[t-SNE] Computed conditional probabilities for sample 73000 / 100535
[t-SNE] Computed conditional probabilities for sample 74000 / 100535
[t-SNE] Computed conditional probabilities for sample 75000 / 100535
[t-SNE] Computed conditional probabilities for sample 76000 / 100535
[t-SNE] Computed conditional probabilities for sample 77000 / 100535
[t-SNE] Computed conditional probabilities for sample 78000 / 100535
[t-SNE] Computed conditional probabilities for sample 79000 / 100535
[t-SNE] Computed conditional probabilities for sample 80000 / 100535
[t-SNE] Computed conditional probabilities for sample 81000 / 100535
[t-SNE] Computed conditional probabilities for sample 82000 / 100535
[t-SNE] Computed conditional probabilities for sample 83000 / 100535
[t-SNE] Computed conditional probabilities for sample 84000 / 100535
[t-SNE] Computed conditional probabilities for sample 85000 / 100535
[t-SNE] Computed conditional probabilities for sample 86000 / 100535
[t-SNE] Computed conditional probabilities for sample 87000 / 100535
[t-SNE] Computed conditional probabilities for sample 88000 / 100535
[t-SNE] Computed conditional probabilities for sample 89000 / 100535
[t-SNE] Computed conditional probabilities for sample 90000 / 100535
[t-SNE] Computed conditional probabilities for sample 91000 / 100535
[t-SNE] Computed conditional probabilities for sample 92000 / 100535
[t-SNE] Computed conditional probabilities for sample 93000 / 100535
[t-SNE] Computed conditional probabilities for sample 94000 / 100535
[t-SNE] Computed conditional probabilities for sample 95000 / 100535
[t-SNE] Computed conditional probabilities for sample 96000 / 100535
[t-SNE] Computed conditional probabilities for sample 97000 / 100535
[t-SNE] Computed conditional probabilities for sample 98000 / 100535
[t-SNE] Computed conditional probabilities for sample 99000 / 100535
[t-SNE] Computed conditional probabilities for sample 100000 / 100535
[t-SNE] Computed conditional probabilities for sample 100535 / 100535
[t-SNE] Mean sigma: 0.000000
[t-SNE] KL divergence after 250 iterations with early exaggeration: 96.575554
[t-SNE] Error after 251 iterations: 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
In [340]:
import bokeh.plotting as bp
from bokeh.models import HoverTool, BoxSelectTool
from bokeh.plotting import figure, show, output_notebook, reset_output
from bokeh.palettes import d3
import bokeh.models as bmo
from bokeh.io import save, output_file
output_notebook()

plot_tfidf = bp.figure(plot_width=700, plot_height=600,
                       title="tf-idf clustering of the item description",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)
Loading BokehJS ...
In [341]:
tfidf_df = pd.DataFrame(tsne_tfidf, columns=['x', 'y'])
In [342]:
tfidf_df['description'] = lemma_books
In [343]:
plot_tfidf.scatter(x='x', y='y', source=tfidf_df, alpha=0.7)
hover = plot_tfidf.select(dict(type=HoverTool))
hover.tooltips={"description": "@description"}
show(plot_tfidf)

Using the bokeh library, we can develop a scatter plot visualization showing each happy moment and it's description. The Tf_idf algorithm does an adequate job in separating each happy moment into clusters.

But can we do better? The below cells detail the use of the MiniBatchKmeans algorithm to cluster our happy moments together. This time, 13 clusters are used as the results were semantically better than compared to using 7 clusters used previously.

In [344]:
from sklearn.cluster import MiniBatchKMeans

num_clusters = 13 # need to be selected wisely
kmeans_model = MiniBatchKMeans(n_clusters=num_clusters,
                               init='k-means++',
                               n_init=1,
                               init_size=1000, batch_size=1000, verbose=0, max_iter=250)
In [345]:
kmeans = kmeans_model.fit(vz_sample)
kmeans_clusters = kmeans.predict(vz_sample)
kmeans_distances = kmeans.transform(vz_sample)
# reduce dimension to 2 using tsne
tsne_kmeans = tsne_model.fit_transform(kmeans_distances)
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 100535 samples in 0.251s...
[t-SNE] Computed neighbors for 100535 samples in 120.761s...
[t-SNE] Computed conditional probabilities for sample 1000 / 100535
[t-SNE] Computed conditional probabilities for sample 2000 / 100535
[t-SNE] Computed conditional probabilities for sample 3000 / 100535
[t-SNE] Computed conditional probabilities for sample 4000 / 100535
[t-SNE] Computed conditional probabilities for sample 5000 / 100535
[t-SNE] Computed conditional probabilities for sample 6000 / 100535
[t-SNE] Computed conditional probabilities for sample 7000 / 100535
[t-SNE] Computed conditional probabilities for sample 8000 / 100535
[t-SNE] Computed conditional probabilities for sample 9000 / 100535
[t-SNE] Computed conditional probabilities for sample 10000 / 100535
[t-SNE] Computed conditional probabilities for sample 11000 / 100535
[t-SNE] Computed conditional probabilities for sample 12000 / 100535
[t-SNE] Computed conditional probabilities for sample 13000 / 100535
[t-SNE] Computed conditional probabilities for sample 14000 / 100535
[t-SNE] Computed conditional probabilities for sample 15000 / 100535
[t-SNE] Computed conditional probabilities for sample 16000 / 100535
[t-SNE] Computed conditional probabilities for sample 17000 / 100535
[t-SNE] Computed conditional probabilities for sample 18000 / 100535
[t-SNE] Computed conditional probabilities for sample 19000 / 100535
[t-SNE] Computed conditional probabilities for sample 20000 / 100535
[t-SNE] Computed conditional probabilities for sample 21000 / 100535
[t-SNE] Computed conditional probabilities for sample 22000 / 100535
[t-SNE] Computed conditional probabilities for sample 23000 / 100535
[t-SNE] Computed conditional probabilities for sample 24000 / 100535
[t-SNE] Computed conditional probabilities for sample 25000 / 100535
[t-SNE] Computed conditional probabilities for sample 26000 / 100535
[t-SNE] Computed conditional probabilities for sample 27000 / 100535
[t-SNE] Computed conditional probabilities for sample 28000 / 100535
[t-SNE] Computed conditional probabilities for sample 29000 / 100535
[t-SNE] Computed conditional probabilities for sample 30000 / 100535
[t-SNE] Computed conditional probabilities for sample 31000 / 100535
[t-SNE] Computed conditional probabilities for sample 32000 / 100535
[t-SNE] Computed conditional probabilities for sample 33000 / 100535
[t-SNE] Computed conditional probabilities for sample 34000 / 100535
[t-SNE] Computed conditional probabilities for sample 35000 / 100535
[t-SNE] Computed conditional probabilities for sample 36000 / 100535
[t-SNE] Computed conditional probabilities for sample 37000 / 100535
[t-SNE] Computed conditional probabilities for sample 38000 / 100535
[t-SNE] Computed conditional probabilities for sample 39000 / 100535
[t-SNE] Computed conditional probabilities for sample 40000 / 100535
[t-SNE] Computed conditional probabilities for sample 41000 / 100535
[t-SNE] Computed conditional probabilities for sample 42000 / 100535
[t-SNE] Computed conditional probabilities for sample 43000 / 100535
[t-SNE] Computed conditional probabilities for sample 44000 / 100535
[t-SNE] Computed conditional probabilities for sample 45000 / 100535
[t-SNE] Computed conditional probabilities for sample 46000 / 100535
[t-SNE] Computed conditional probabilities for sample 47000 / 100535
[t-SNE] Computed conditional probabilities for sample 48000 / 100535
[t-SNE] Computed conditional probabilities for sample 49000 / 100535
[t-SNE] Computed conditional probabilities for sample 50000 / 100535
[t-SNE] Computed conditional probabilities for sample 51000 / 100535
[t-SNE] Computed conditional probabilities for sample 52000 / 100535
[t-SNE] Computed conditional probabilities for sample 53000 / 100535
[t-SNE] Computed conditional probabilities for sample 54000 / 100535
[t-SNE] Computed conditional probabilities for sample 55000 / 100535
[t-SNE] Computed conditional probabilities for sample 56000 / 100535
[t-SNE] Computed conditional probabilities for sample 57000 / 100535
[t-SNE] Computed conditional probabilities for sample 58000 / 100535
[t-SNE] Computed conditional probabilities for sample 59000 / 100535
[t-SNE] Computed conditional probabilities for sample 60000 / 100535
[t-SNE] Computed conditional probabilities for sample 61000 / 100535
[t-SNE] Computed conditional probabilities for sample 62000 / 100535
[t-SNE] Computed conditional probabilities for sample 63000 / 100535
[t-SNE] Computed conditional probabilities for sample 64000 / 100535
[t-SNE] Computed conditional probabilities for sample 65000 / 100535
[t-SNE] Computed conditional probabilities for sample 66000 / 100535
[t-SNE] Computed conditional probabilities for sample 67000 / 100535
[t-SNE] Computed conditional probabilities for sample 68000 / 100535
[t-SNE] Computed conditional probabilities for sample 69000 / 100535
[t-SNE] Computed conditional probabilities for sample 70000 / 100535
[t-SNE] Computed conditional probabilities for sample 71000 / 100535
[t-SNE] Computed conditional probabilities for sample 72000 / 100535
[t-SNE] Computed conditional probabilities for sample 73000 / 100535
[t-SNE] Computed conditional probabilities for sample 74000 / 100535
[t-SNE] Computed conditional probabilities for sample 75000 / 100535
[t-SNE] Computed conditional probabilities for sample 76000 / 100535
[t-SNE] Computed conditional probabilities for sample 77000 / 100535
[t-SNE] Computed conditional probabilities for sample 78000 / 100535
[t-SNE] Computed conditional probabilities for sample 79000 / 100535
[t-SNE] Computed conditional probabilities for sample 80000 / 100535
[t-SNE] Computed conditional probabilities for sample 81000 / 100535
[t-SNE] Computed conditional probabilities for sample 82000 / 100535
[t-SNE] Computed conditional probabilities for sample 83000 / 100535
[t-SNE] Computed conditional probabilities for sample 84000 / 100535
[t-SNE] Computed conditional probabilities for sample 85000 / 100535
[t-SNE] Computed conditional probabilities for sample 86000 / 100535
[t-SNE] Computed conditional probabilities for sample 87000 / 100535
[t-SNE] Computed conditional probabilities for sample 88000 / 100535
[t-SNE] Computed conditional probabilities for sample 89000 / 100535
[t-SNE] Computed conditional probabilities for sample 90000 / 100535
[t-SNE] Computed conditional probabilities for sample 91000 / 100535
[t-SNE] Computed conditional probabilities for sample 92000 / 100535
[t-SNE] Computed conditional probabilities for sample 93000 / 100535
[t-SNE] Computed conditional probabilities for sample 94000 / 100535
[t-SNE] Computed conditional probabilities for sample 95000 / 100535
[t-SNE] Computed conditional probabilities for sample 96000 / 100535
[t-SNE] Computed conditional probabilities for sample 97000 / 100535
[t-SNE] Computed conditional probabilities for sample 98000 / 100535
[t-SNE] Computed conditional probabilities for sample 99000 / 100535
[t-SNE] Computed conditional probabilities for sample 100000 / 100535
[t-SNE] Computed conditional probabilities for sample 100535 / 100535
[t-SNE] Mean sigma: 0.000000
[t-SNE] KL divergence after 250 iterations with early exaggeration: 95.268776
[t-SNE] Error after 251 iterations: 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
In [346]:
#combined_sample.reset_index(drop=True, inplace=True)
kmeans_df = pd.DataFrame(tsne_kmeans, columns=['x', 'y'])
kmeans_df['cluster'] = kmeans_clusters
kmeans_df['description'] = tfidf_df['description']

#kmeans_df['cluster']=kmeans_df.cluster.astype(str).astype('category')
In [347]:
plot_kmeans = bp.figure(plot_width=700, plot_height=600,
                        title="KMeans clustering of the description",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)
import numpy as np import matplotlib.pyplot as plt from matplotlib import cm from bokeh.models import HoverTool, DatetimeTickFormatter,ColumnDataSource source = ColumnDataSource(data=dict(x=kmeans_df['x'], y=kmeans_df['y'], color=colormap[kmeans_clusters], description=kmeans_df['description'], cluster=kmeans_df['cluster'])) plot_kmeans.scatter(x='x', y='y', color='color', source=source) hover = plot_kmeans.select(dict(type=HoverTool)) hover.tooltips={"description": "@description"} show(plot_kmeans)
In [348]:
label_color_map = {0:'lightgrey',
                1:'lightcoral',
                2:'sandybrown',
                3:'papayawhip',
                4:'lemonchiffon',
                5:'darkkhaki',
                6:'yellow',
                7:'greenyellow',
                8:'lightgreen',
                9:'aquamarine',
                10:'darkkhaki',
                11:'deepskyblue',
                12:'dodgerblue',
                13:'navy',
                14:'blueviolet'}
In [349]:
label_color = [label_color_map[l] for l in kmeans_model.labels_] 
In [350]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from bokeh.models import HoverTool, DatetimeTickFormatter,ColumnDataSource
source = ColumnDataSource(data=dict(x=kmeans_df['x'], y=kmeans_df['y'],
                                    #color=colormap[kmeans_clusters],
                                    color=label_color,
                                    description=kmeans_df['description'],
                                    cluster=kmeans_df['cluster']))

plot_kmeans.scatter(x='x', y='y', color='color', source=source)
hover = plot_kmeans.select(dict(type=HoverTool))
hover.tooltips={"description": "@description"}
show(plot_kmeans)

We see that the Kmeans algorithm does a much better job in clearly separating the happy moments from those that are dissimilar to it. The clusters are color coded and are clearly separable.

By looking at the descriptions, we can observe that the Kmeans algorithm significantly takes into account verbs when clustering the data. i.e. made, went.

Below, we see words associated with the first 12 clusters.

In [351]:
common_words = kmeans_model.cluster_centers_.argsort()[:,-1:-11:-1]
for num, centroid in enumerate(common_words):
    print(str(num) + ' : ' + ', '.join(vocab[word] for word in centroid))
0 : work, found, happy, today, daughter, received, week, dinner, last, favorite
1 : went, shopping, movie, friend, temple, family, dinner, walk, see, restaurant
2 : birthday, party, friend, great, celebrated, family, celebrate, gift, happy, surprise
3 : moment, life, happiest, happy, day, month, family, friend, last, one
4 : good, really, got, friend, night, made, dinner, work, ate, happy
5 : home, came, work, got, day, happy, husband, brought, made, wife
6 : friend, day, old, best, got, last, happy, met, school, mother
7 : made, happy, event, month, dinner, last, today, past, work, feel
8 : got, new, job, bought, work, car, happy, promotion, today, raise
9 : able, get, work, son, time, go, happy, day, sleep, friend
10 : time, long, first, spent, friend, got, spend, happy, family, year
11 : game, video, played, playing, play, friend, baseball, team, new, basketball
12 : dog, walk, took, got, happy, park, played, went, playing, new

Lastly, we can implement a LDA topic model to see if it performs better than our Kmeans clustering.

For this analysis, 9 topics were chosen as this number generated more clearly seprable topics than 7 as used in Tf_idf, and 13 as used in Kmeans.

In [352]:
from sklearn.decomposition import LatentDirichletAllocation
In [353]:
from sklearn.feature_extraction.text import CountVectorizer
# LDA can only use raw term counts for LDA because it is a probabilistic graphical model

no_features = 1000

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tf = tf_vectorizer.fit_transform(lemma_books)
tf_feature_names = tf_vectorizer.get_feature_names()
In [354]:
no_topics = 9

lda = LatentDirichletAllocation(n_topics=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)
/Users/matthewvitha/anaconda/lib/python3.5/site-packages/sklearn/decomposition/online_lda.py:294: DeprecationWarning: n_topics has been renamed to n_components in version 0.19 and will be removed in 0.21
  DeprecationWarning)
In [355]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(topic_idx)
        print([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]])
In [356]:
no_top_words = 10
display_topics(lda, tf_feature_names, no_top_words)
0
['daughter', 'morning', 'husband', 'received', 'gave', 'little', 'big', 'early', 'today', 'company']
1
['week', 'month', 'new', 'getting', 'really', 'good', 'happy', 'got', 'ago', 'played']
2
['went', 'time', 'dinner', 'night', 'family', 'nice', 'birthday', 'dog', 'friend', 'took']
3
['car', 'love', 'thing', 'started', 'time', 'like', 'food', 'got', 'seeing', 'free']
4
['day', 'happy', 'son', 'home', 'going', 'felt', 'mother', 'work', 'came', 'got']
5
['new', 'got', 'job', 'bought', 'wife', 'event', 'work', 'told', 'trip', 'month']
6
['able', 'work', 'finally', 'happy', 'hour', 'money', 'past', 'finished', 'working', 'time']
7
['friend', 'old', 'year', 'favorite', 'game', 'happy', 'got', 'went', 'school', 'best']
8
['happy', 'moment', 'life', 'feel', 'house', 'make', 'saw', 'time', 'happiest', 'good']
In [357]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)
/Users/matthewvitha/anaconda/lib/python3.5/site-packages/pyLDAvis/_prepare.py:257: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  return pd.concat([default_term_info] + list(topic_dfs))
Out[357]:

As seen in the above interactive bubble plot, the LDA model performs well in separating the happy moments into distinct topics.

However, given that the model has a difficult time clearly separating topics 1-3, which comprise the largest percentage of the vocabulary in the corpus, it is safe to say that the Kmeans clustering performs the best in terms of grouping happy moments together.

In [358]:
from wordcloud import WordCloud, STOPWORDS 
import matplotlib.pyplot as plt 
from subprocess import check_output
In [359]:
df_small_a = df_small
cleaned_hm_a = [str(i) for i in df_small_a['cleaned_hm']]
In [360]:
letters_only = re.sub("[^a-zA-Z]",  # Search for all non-letters
                          " ",          # Replace all non-letters with spaces
                          str(df_small_a['cleaned_hm']))
In [361]:
wordcloud = WordCloud(width = 15000, height = 2000, 
                background_color ='grey',max_words=100,  
                min_font_size = 10).generate(letters_only)

The word cloud below helps us to close out thisn project! Thanks for your time and attention!

In [362]:
print(wordcloud)
fig = plt.figure(1)
fig.set_size_inches(15.5, 7.5)
#fig.savefig('test2png.png', dpi=100)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
<wordcloud.wordcloud.WordCloud object at 0x1a2fa23da0>